This course will provide a lot of very useful basic knowledge on open source data workflow.
Link to my GitHub repository:
https://github.com/MikkoPatronen/IODS-project
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data is a subset of all the variables from the data: “International survey of Approaches to Learning”. The data consists of 166 rows (observations) and 7 columns (variables). The variables are “gender”,“age”,“attitude”, “deep”, “stra”, “surf” and “points”. The variables “gender” (in years) and “age” (M=male, F=female) are self-explanatory. The variable “attitude” describes global attitude toward statistics (it is sum of ten variables). The variable “deep” describes deep approach (it is a mean of three variables). The variable “stra” describes strategic approach (it is a mean of two variables). The variable “surf” describes surface approach (it is a mean of three variables). The variable “points” is exam points.
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
There are almost twice as many answers from females (n = 110) than males (n = 56). Their age varies between 17 and 55 years, half of the participants being at least 22 years old.
I chose the variables “attitude”, “stra” and “surf” as explanatory variables for the target variable “points”. Here is the summary for the fitted model:
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to the model the variables “stra” and “surf” do not have statistically significant relationship with the target (p-values 0.117 and 0.466). Since the variable “surf” has bigger p-value, let us remove that from the model and fit a new model without it. Here is the new fitted model summary:
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Now the model summary shows that the variable “stra” does not have statistically significant relationship with the target (p-value is 0.089), so I create a new model without it. Here is the final model summary:
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This model has an assumption that the attitude towards statistics has a linear relationship with exam points. The explanatory variable “attitude” has a statistically significant relationship with the target variable “points” (p-value almost 0).
The model shows that the y-intercept is at 11.637 and the linear model has a positive slope of 0.353, which means that when the attitude goes up one point, the exam points rise 0.353. This seems reasonable.
Multiple R squared can be interpreted as displaying the amount of variance (in percentages) the explanatory variable(s) explain in the target variable. So in this model it could be interpreted that the attitude explains about 19 percent of the variance in exam points.
The model has an assumption that the residuals are normally distributed. The residuals are the differences between the predicted and observed values in a given point.
The data consists of two data sets retrieved from the UCI Machine Learning Repository using Student performance dataset from https://archive.ics.uci.edu/ml/datasets/Student+Performance. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). One table was students from a math course, and the other students from Portuguese course. The two datasets were combined for this analysis.
The data consists of 382 obsevations (students) and 35 variables. The 35 column names are as follows:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Two variables are describing the student’s alcohol use. The variable “alc_use” is the average alcochol use on weekdays and weekends together. The variable “high_use” indicates if the average alcohol use is more than 2.
I chose 4 variables that I believe to have some relationship with the alcohol consumption. The variables are “sex”, “studytime”, “activities” and “romantic”. The gender (“sex”), extra-curricular activities (“activities”) and relationship status (“romantic”) are binary variables. Weekly studytime is numeric varying in 4 different categories between 1 and 10 hours. I would hypothesize that males, those who do not have extra-curricular activities, those who are not in a romantic relationship, and those with lower amounts of weekly studytime, have higher probability to high use alcohol consumption.
I think a barplot will describe the relationships quite well. This barplot below shows the mean of alcohol consumption according to gender. It can clearly be seen from the plot that males are the majority in larger means of alcohol consumption. This is especially clear when the mean value is 3 or higher. The summary statistic by group tells us the numbers: there are 72 males and 42 females in the high alcohol use group. This supports our hypothesis.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 4 x 3
## # Groups: sex [?]
## sex high_use count
## <fct> <lgl> <int>
## 1 F FALSE 156
## 2 F TRUE 42
## 3 M FALSE 112
## 4 M TRUE 72
The next barplot shows the mean of alcohol consumption according to the binary variable “activities”, which describes whether or not the individual has extra-curricular activities. This plot is not as clear as the previous one, but those who do not have extra-curricular activities are a slight majority in mean values above 2. The summary statistic by group shows that there are 59 “no’s” and 55 “yes’s” in the high alcohol use group. This supports our hypothesis, but is really small difference.
## # A tibble: 4 x 3
## # Groups: activities [?]
## activities high_use count
## <fct> <lgl> <int>
## 1 no FALSE 122
## 2 no TRUE 59
## 3 yes FALSE 146
## 4 yes TRUE 55
The next barplot shows the mean of alcohol consumption according to the binary variable “romantic”, which describes whether or not the individual is in a romantic relationship. This plot clearly supports our hypothesis: those who are not in a romantic relationship are the majority in mean values of 2 and higher. The summary statistic by group is also very clear to support our hypothesis by showing that in the high alcohol use group there are 81 of those who are not in a romantic relationship but only 33 of those who are in a romantic relationship.
## # A tibble: 4 x 3
## # Groups: romantic [?]
## romantic high_use count
## <fct> <lgl> <int>
## 1 no FALSE 180
## 2 no TRUE 81
## 3 yes FALSE 88
## 4 yes TRUE 33
Finally we have a boxplot that also supports our hypothesis: those who have lower amounts of studytime per week are forming the majority in having high alcohol use. The summary statistic by group also shows that when the studytime decreases, the amount of individuals that have high alcohol use increases.
## # A tibble: 8 x 3
## # Groups: studytime [?]
## studytime high_use count
## <int> <lgl> <int>
## 1 1 FALSE 58
## 2 1 TRUE 42
## 3 2 FALSE 135
## 4 2 TRUE 60
## 5 3 FALSE 52
## 6 3 TRUE 8
## 7 4 FALSE 23
## 8 4 TRUE 4
Next I will conduct a logistic regression analysis to study the relationships between the independent variables (“sex”, “studytime”, “activities” and “romantic”) and the dependent variable (high_use: the mean alcohol use > 2). Here is the summary of the analysis:
##
## Call:
## glm(formula = high_use ~ sex + romantic + activities + studytime,
## family = "binomial", data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2083 -0.8872 -0.6806 1.1819 2.0591
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.17051 0.38376 -0.444 0.65681
## sexM 0.69839 0.24552 2.845 0.00445 **
## romanticyes -0.08315 0.25272 -0.329 0.74213
## activitiesyes -0.26321 0.23571 -1.117 0.26414
## studytime -0.45540 0.15754 -2.891 0.00384 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 439.32 on 377 degrees of freedom
## AIC: 449.32
##
## Number of Fisher Scoring iterations: 4
## (Intercept) sexM romanticyes activitiesyes studytime
## -0.1705116 0.6983944 -0.0831546 -0.2632122 -0.4553951
From the summary we can see that gender and studytime both have statistically significant estimates (p-value 0.00445 and 0.00384, respectively). I would like to see the intended results for “romantic” and “activities” also, since the reference level is automatically set to “yes” and I am supposed to reference the level “no”. I can’t find the way to change this setting so I’ll just do the analysis without those two variables. So here is the new model:
##
## Call:
## glm(formula = high_use ~ sex + studytime, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1429 -0.8820 -0.7177 1.2123 2.1421
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2621 0.3726 -0.704 0.48169
## sexM 0.6618 0.2413 2.743 0.00610 **
## studytime -0.4815 0.1566 -3.075 0.00211 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 440.71 on 379 degrees of freedom
## AIC: 446.71
##
## Number of Fisher Scoring iterations: 4
## (Intercept) sexM studytime
## -0.2621283 0.6618371 -0.4814874
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.7694123 0.3707640 1.6029554
## sexM 1.9383500 1.2105015 3.1228545
## studytime 0.6178637 0.4502538 0.8331592
This model also gives statistically significant estimates for gender and studytime. Odds ratios can be interpreted so that males are 1.94 (95% confidence interval 1.21 - 3.12) times more likely to be high users of alcohol and also that studytime increase is associated with decreasing mean of alcohol consumption (OR 0.62, 95% confidence interval 0.45 - 0.83). These both interpretations support the original hypothesis: males are more likely to be high users of alcohol and low amount of studytime is associated with higher amounts of alcohol use.
This week we are using a data (included in the MASS package of R) called “Boston”, which consists of variables to study Housing Values in Suburbs of Boston. The variables in the data are “per capita crime rate by town”, “proportion of residential land zoned for lots over 25,000 sq.ft”, “proportion of non-retail business acres per town”, “Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)”, “nitrogen oxides concentration (parts per 10 million)”, “average number of rooms per dwelling”, “proportion of owner-occupied units built prior to 1940”, “weighted mean of distances to five Boston employment centres”, “index of accessibility to radial highways”, “full-value property-tax rate per $10,000”, “pupil-teacher ratio by town”, “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town”, “lower status of the population (percent)” and “median value of owner-occupied homes in $1000s”. This information comes from here.
The structure and dimensions of the data are presented here:
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
As we can see, the dimensions of Boston data are 506 rows and 14 columns. Here is a summary of the variables:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Now, let’s print correlation coefficients between variables and a graphical overview of those correlations in the data:
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
In the correlation matrix the colors indicate the direction of the correlation (red is negative and blue is positive correlation). Darker colors indicate stronger correlations. Based on this plot we can see that there is one variable pair that has particularly strong positive correlation: “rad” and “tax” (“index of accessibility to radial highways” and “full-value property-tax rate per $10,000”), r=0.91. The variable “dis” has strong negative correlations with variables “age” (r=-0.75), “nox” (r=-0.77) and “indus” (r=-0.71). Also strong negative correlation is shown between variables “lstat” and “medv” (r=-0.74). The mentioned variables are: dis = “weighted mean of distances to five Boston employment centres” nox = “nitrogen oxides concentration (parts per 10 million)” indus = “proportion of non-retail business acres per town” lstat = “lower status of the population (percent)” medv = “median value of owner-occupied homes in $1000s”
Let’s standardize the dataset and print out summaries of the scaled data:
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
We can see that after standardizing the mean of each variable is zero.
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, label = c("low", "med_low", "med_high", "high"), breaks = bins, include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2326733 0.2450495 0.2599010
##
## Group means:
## zn indus chas nox rm
## low 0.9667397 -0.9048142 -0.08661679 -0.8841363 0.4182980
## med_low -0.1200477 -0.2375617 0.06274329 -0.5527046 -0.1847747
## med_high -0.3815632 0.1759325 0.24466389 0.3594167 0.1659049
## high -0.4872402 1.0170492 -0.12234430 1.0391104 -0.4700333
## age dis rad tax ptratio
## low -0.9051405 0.8536728 -0.6839198 -0.7271865 -0.46620442
## med_low -0.3211803 0.3073077 -0.5615811 -0.4277290 -0.02221155
## med_high 0.3981304 -0.3346304 -0.4099578 -0.3007148 -0.24727254
## high 0.8213257 -0.8610618 1.6388211 1.5145512 0.78158339
## black lstat medv
## low 0.37173893 -0.75652658 0.50428356
## med_low 0.31116278 -0.07786545 -0.06868928
## med_high 0.07241503 -0.06921595 0.21521067
## high -0.74650664 0.93571304 -0.71642081
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10154812 0.69450191 -0.83101653
## indus 0.01113254 -0.29086989 0.19031801
## chas -0.09058595 -0.04225589 0.16227843
## nox 0.42164075 -0.77586108 -1.31015820
## rm -0.10292095 -0.09341653 -0.10493212
## age 0.23308805 -0.36306728 -0.20887145
## dis -0.03544963 -0.33367468 0.02160288
## rad 3.17836534 0.94673592 -0.38234461
## tax -0.02397136 0.01918269 0.82783519
## ptratio 0.10163743 -0.08284321 -0.17268339
## black -0.12698479 0.05791888 0.14404636
## lstat 0.21316334 -0.17107222 0.45291571
## medv 0.16760263 -0.45910511 -0.21308796
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9480 0.0386 0.0134
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
You can check out my R script for the Data wrangling part here.
This week’s tasks present dimensionality reduction techniques. Principal Component Analysis (PCA) will be applied.
The R script file that created the data file can be viewed here. Let us view information about the data:
rm(list=ls())
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(MASS)
library(corrplot)
human <- read.table("human.txt")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 0.198 0.931 0.861 0.977 0.989 ...
## $ Labo.FM : num 0.199 0.685 0.211 0.633 0.747 ...
## $ Edu.Exp : num 9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
## $ Life.Exp : num 60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
## $ GNI : int 1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
## $ Mat.Mor : int 400 21 89 69 29 6 4 26 37 22 ...
## $ Ado.Birth: num 86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
## $ Parli.F : num 27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...
The data includes 155 observations of 8 variables. They are:
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human)
Since the variables are not scaled, the means are quite different across variables.
The ggpairs()-plot shows distributions of the variables and shows the correlations between the variables. The first four of the variables (Edu2.FM, Labo.FM, Edu.Exp and Life.Exp) have negative skewness and the remaining four variables are positively skewed.
The corrplot() visualizes correlations between variables:
cor(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI
## Edu2.FM 1.000000000 0.009564039 0.59325156 0.5760299 0.43030485
## Labo.FM 0.009564039 1.000000000 0.04732183 -0.1400125 -0.02173971
## Edu.Exp 0.593251562 0.047321827 1.00000000 0.7894392 0.62433940
## Life.Exp 0.576029853 -0.140012504 0.78943917 1.0000000 0.62666411
## GNI 0.430304846 -0.021739705 0.62433940 0.6266641 1.00000000
## Mat.Mor -0.660931770 0.240461075 -0.73570257 -0.8571684 -0.49516234
## Ado.Birth -0.529418415 0.120158862 -0.70356489 -0.7291774 -0.55656208
## Parli.F 0.078635285 0.250232608 0.20608156 0.1700863 0.08920818
## Mat.Mor Ado.Birth Parli.F
## Edu2.FM -0.6609318 -0.5294184 0.07863528
## Labo.FM 0.2404611 0.1201589 0.25023261
## Edu.Exp -0.7357026 -0.7035649 0.20608156
## Life.Exp -0.8571684 -0.7291774 0.17008631
## GNI -0.4951623 -0.5565621 0.08920818
## Mat.Mor 1.0000000 0.7586615 -0.08944000
## Ado.Birth 0.7586615 1.0000000 -0.07087810
## Parli.F -0.0894400 -0.0708781 1.00000000
res1 <- cor.mtest(human, conf.level = .95)
cor(human) %>% corrplot(p.mat = res1$p, method = "color", type = "upper",
sig.level = c(.001, .01, .05), pch.cex = .9,
insig = "label_sig", pch.col = "white", order = "AOE")
Blue color means positive correlation and red negative correlation. Lighter colors mean weaker association and darker colors stronger.
Next we will conduct a principal component analysis (PCA) on the not standardized human data and show the variability captured by the principal components. A biplot will be drawn to display the observations by the first two principal components.
# create and print out a summary of pca_human
pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Let us standardize the data and then perform the PCA again:
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# create and print out a summary of pca_human
pca_human <- prcomp(human_std)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Standardizing the data makes a big difference. The results from un-standardized and standardized datas are very different: the results from un-standardized variables showed that PC1 captures all the variation, but with standardized variables PC1 now captures 53,6% (PC2 captures 16,2% and PC3 9,6%).
The variable Parli.F (the share of parliamentary seats held by women) has strong positive correlation with the variable Labo.FM (the ratio of labour force participation of females and males in each country). The variable Ado.Birth (the adolescent birth rate) is positively correlated with PC1.
PCA reduces dimensions of a dataset into components that explain the variance in the data. The arrows of the variables can be interpreted so that the PC1 consists of Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNI. PC2 consists of Parli.F and Labo.FM. PC1 explains 53,6% of the variance in the data. PC2 explains 16,2% of the variance.
The structure and the dimensions of the data:
I picked the variables “breakfast”, “evening”, “sex”, “sugar”, “where”, “lunch” to form my subdata. Here is some information about the structure of the data and some visualizations:
library(FactoMineR)
library(tidyr)
library(dplyr)
data("tea")
tea_mca = tea[, c("breakfast", "evening", "sex", "sugar", "where", "lunch")]
summary(tea_mca)
## breakfast evening sex sugar
## breakfast :144 evening :103 F:178 No.sugar:155
## Not.breakfast:156 Not.evening:197 M:122 sugar :145
##
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
str(tea_mca)
## 'data.frame': 300 obs. of 6 variables:
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_mca) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The dataset tea_mca consists of 300 observations of 6 variables.
mca <- MCA(tea_mca, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_mca, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.216 0.198 0.178 0.171 0.154 0.132
## % of var. 18.552 16.987 15.288 14.679 13.222 11.317
## Cumulative % of var. 18.552 35.538 50.827 65.506 78.727 90.045
## Dim.7
## Variance 0.116
## % of var. 9.955
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 0.408 0.257 0.206 | -0.315 0.167 0.122 |
## 2 | -0.567 0.495 0.487 | -0.272 0.124 0.112 |
## 3 | -0.095 0.014 0.010 | 0.194 0.063 0.043 |
## 4 | 0.601 0.556 0.460 | -0.364 0.223 0.169 |
## 5 | 0.203 0.063 0.040 | 0.089 0.013 0.008 |
## 6 | 0.116 0.021 0.018 | -0.475 0.379 0.296 |
## 7 | -0.077 0.009 0.007 | -0.426 0.305 0.230 |
## 8 | -0.095 0.014 0.010 | 0.194 0.063 0.043 |
## 9 | -0.371 0.212 0.117 | -0.221 0.082 0.042 |
## 10 | -0.091 0.013 0.006 | 0.294 0.145 0.062 |
## Dim.3 ctr cos2
## 1 0.447 0.373 0.246 |
## 2 0.028 0.001 0.001 |
## 3 -0.740 1.022 0.632 |
## 4 -0.153 0.044 0.030 |
## 5 0.218 0.089 0.046 |
## 6 -0.214 0.086 0.060 |
## 7 0.386 0.278 0.188 |
## 8 -0.740 1.022 0.632 |
## 9 0.712 0.948 0.434 |
## 10 0.544 0.554 0.211 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## breakfast | -0.279 2.880 0.072 -4.637 | 0.068 0.187
## Not.breakfast | 0.258 2.659 0.072 4.637 | -0.063 0.173
## evening | 0.513 6.948 0.137 6.410 | 0.903 23.532
## Not.evening | -0.268 3.633 0.137 -6.410 | -0.472 12.303
## F | -0.557 14.151 0.452 -11.624 | 0.167 1.396
## M | 0.812 20.646 0.452 11.624 | -0.244 2.036
## No.sugar | -0.654 17.043 0.458 -11.701 | -0.143 0.891
## sugar | 0.700 18.218 0.458 11.701 | 0.153 0.953
## chain store | 0.160 1.259 0.045 3.686 | -0.046 0.112
## chain store+tea shop | -0.661 8.738 0.153 -6.771 | 0.502 5.513
## cos2 v.test Dim.3 ctr cos2 v.test
## breakfast 0.004 1.131 | 0.790 28.019 0.577 13.131 |
## Not.breakfast 0.004 -1.131 | -0.730 25.864 0.577 -13.131 |
## evening 0.426 11.287 | -0.279 2.500 0.041 -3.490 |
## Not.evening 0.426 -11.287 | 0.146 1.307 0.041 3.490 |
## F 0.041 3.493 | -0.368 7.526 0.198 -7.695 |
## M 0.041 -3.493 | 0.538 10.981 0.198 7.695 |
## No.sugar 0.022 -2.561 | -0.075 0.270 0.006 -1.338 |
## sugar 0.022 2.561 | 0.080 0.289 0.006 1.338 |
## chain store 0.004 -1.050 | -0.327 6.398 0.190 -7.541 |
## chain store+tea shop 0.089 5.147 | 0.500 6.081 0.088 5.128 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.072 0.004 0.577 |
## evening | 0.137 0.426 0.041 |
## sex | 0.452 0.041 0.198 |
## sugar | 0.458 0.022 0.006 |
## where | 0.178 0.170 0.196 |
## lunch | 0.001 0.526 0.052 |
The summary of the MCA lists the explained variances. The first dimension explains 18,55% of the variance the second dimension explains 16,99%, the third 15,29%.
plot(mca, invisible=c("ind"), habillage="quali")
This week’s tasks focus on analysis of longitudinal data. Analysis will be conducted on two datasets in the way presented in chapters 8 and 9 from the book “Multivariate Analysis for the Behavioral Sciences”.
The datasets have been wrangled from wide form to long form before analysis. The R script file that created the data files can be viewed here.
rm(list=ls())
setwd("~/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub/IODS-project/data")
library(dplyr)
library(stringr)
library(ggplot2)
library(GGally)
library(tidyr)
library(readr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
RATSL <- read.table("RATSL.txt")
BPRSL <- read.table("BPRSL.txt")
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
The RATS data consists of three groups of rats. There are 16 rats in total. The groups were exposed to different diets and their weight in grams were measured weekly for nine weeks. The dimensions (rows and columns) of the RATS data are:
## [1] 176 5
Let us visualize the weight increase of the rats by groups:
From the graph we can see that the rat weights seem to increase in all groups over time. There is also one big rat in group 2 that is much heavier than other rats in that group. In groups 2 and 3 the weight seem to have increased more than in group 1. Let us standardize the values to make the groups easier to compare, and plot the standardized data:
There does not seem to be huge differences in weight changes between groups. The same information can be seen from different visualization below, where the plot is built based on means and standard errors:
Here we can see that the group 2 varies the most (remember that this group has the one rat that is bigger than others) and group 1 has least variation. Let us remove that one big rat from the group 2 and perform summary measure analysis with boxplots:
All the groups have different means. Now we must conduct analysis of variance (ANOVA) to study whether the differences between groups are statistically significant. In this test the null hypothesis is that the group means do not differ.
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 188182 94091 580.4 7.07e-12 ***
## Residuals 11 1783 162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA test the p-value is close to zero. Therefore we can infer that the groups differ in group means.
The BPRS data consists of 40 male subjects that were divided into two groups that received different treatments. All subjects were rated on BPRS (brief psychiatric rating scale) for 8 weeks in total. The scale evaluates schizophrenia among the patients. The dimensions of the dataset (rows and columns) are
## [1] 360 5
Here is a plot of the two groups showing changes among individuals through eight weeks:
This plot is not very informative: it does not tell much about the differences between the treatments, although group 1 seems to be having more coherent effect of decreasing through time. Let us conduct a basic linear regression to study this data further:
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The results indicate that according to this model the treatment is not statistically signicantly associated with the development of the BPRS values. Weeks on the other hand seem to have statistically signicant effect. Next let us try fitting a linear mixed effects model that takes into account the fact that our observations are not independent.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
In this model we can see that standard errors of treatment and week are smaller than in previous model. Let us test which model is better with ANOVA:
## Data: BPRSL
## Models:
## BPRS_reg: bprs ~ week + treatment
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_reg 4 2837.3 2852.9 -1414.7 2829.3
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7 90.624 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA indicates that the linear mixed effects model is better of the two models.
Let us plot the model and compare it with original data:
We can see from the plot that it does give fairly good understanding of the treatments between groups. The first group has a more clear decrease in BPRS values in the model version, just as it was in the original data.